Mapping IP addresses - SQL vs Redis
Table of Contents
Setup #
Indexing 3+ million IP addresses
In the next few blog posts I’m going to go into detail on creating an IP mapping service: IPbot. One of the first challanges was being able to efficiently query three million IP addresses and their associated metadata.
Using MaxMind’s free GeoLite data:
I got a set of some three million addresses and their associated metadata.
My first approach used SQLite for its ease of set up and ingestion but I discovered it took quite a long time to return a single record. Normally SQLite fits the bill perfectly - local, resolves to a single file, lightweight, etc.
I wanted to speed this up, a half-second to a full second seemed pretty long for making a local DB query. Searching for an in-memory solution, Redis looked like a good candidate.
For the use case of joining an IP address with its geographic place, a relational JOIN seems the easy way to go.
I found a nice solution for relating an IP addresses in Redis, a noSQL engine:
Let’s see how much faster this approach is over the naive SQLite approach.
SQLite #
It took quite a while to index the CSV data into SQLite - in the neighborhood of 15 minutes. Querying for 8.8.8.8
three different times vs Redis:
SQLite | Redis |
---|---|
9.425 | 0.004 |
5.637 | 0.004 |
0.477 | 0.004 |
{:.tablestyle} |
It seems like it takes a sec for SQLite to warm up - lets see if that’s consistant when we run 1,000 queries.
Redis #
We query first for the network / CIDR with the numeric representation of the IP address we’re looking for (lets find teh home of Google’s DNS server @ 8.8.8.8
):
from netaddr import IPNetwork
>>> int(IPNetwork('8.8.8.8').ip)
To find the network an IP belongs to we break down an IP address into it’s network and broadcast addresses (python’s netaddr
is very useful for this):
# index cidr metadata by their cidrs as the key
network = int(IPNetwork('8.8.0.0/18')[0])
broadcast = int(IPNetwork('8.8.0.0/18')[-1])
Index the metadata for a CIDR (notation for designating an IP range):
HMSET cidr:8.8.0.0/18 broadcast 134758399 network 134742016 lat 37.7510 long -97.8220 geoid 6252001
then create an index that maps a network’s broadcast address to it’s CIDR:
ZADD cidr:index 134758399 8.8.0.0/18
I did the same for the location metadata - creating an HMSET based on the geoid.
To query for an address we simpily call:
127.0.0.1:6379> ZRANGEBYSCORE cidr:index 134744072 +inf limit 0 1
1) "8.8.0.0/18"
127.0.0.1:6379> HGETALL cidr:8.8.0.0/18
1) "broadcast"
2) "134758399"
3) "lat"
4) "37.7510"
5) "geoid"
6) "6252001"
7) "long"
8) "-97.8220"
9) "network"
10) "134742016"
127.0.0.1:6379> HGETALL geoid:6252001
1) "city"
2) ""
3) "country"
4) "United States"
Indexing the data into Redis was taking a long time - a quick Google releaved that Redis has a special --pipe
mode for ingesting millions of rows of metadata: Mass Insertions. The difference was quite impresive:
Generate a Redis-protocol file:
time (python index_redis.py)
( python index_redis.py; )
281.49s user
5.42s system
98% cpu
4:50.11 total
Ingest protocol file into Redis:
time (cat migration.txt | redis-cli -a redis --pipe)
All data transferred. Waiting for the last reply...
Last reply received from server.
errors: 0, replies: 6788124
real 0m45.654s
user 0m0.952s
sys 0m4.684s
Wow..
I figured there might be a small speed increase but 45 seconds for ingesting a 798 M file is very impresive.
Protip: if you’re indexing a lot of data into Redis, use --pipe
.
Benchmark #
Generate 2,000 random IPv4 addresses:
ipblock = lambda: random.randrange(1, 255)
def get_ipv4_addr():
return '{}.{}.{}.{}'.format(ipblock(), ipblock(), ipblock(), ipblock())
I got a hit rate of 1717 mappable addresses (85%) since not every random address will map to a geographic location.
Running the benchmark took some time:
( python benchmark.py; ) 492.46s user 125.16s system 96% cpu 10:40.84 total
Averaged over 1,000 queries we got these averages:
SQLite | Redis |
---|---|
0.462 s | 0.010 s |
{:.tablestyle} |
That’s a pretty drastic improvement although not suprising since Redis sits in memory while SQLite relies on disk I/O, which is an order of magnitude slower. I am suprised at the half-second responses from SQLite when I’m running an SSD rather than a platter drive.
Size #
In terms of size, we get some apples and oranges comparisions of memory v.s. disk space:
-
SQLite file takes 315M of dosk space
-
Redis instance takes 576.72M of RAM
Came up with the Redis numbers using:
redis-cli -a redis INFO
We have a trade-off of low memory usage and half-second query times v.s. medium/high memory usage and millisecond query times. Seems like a worthwhile compromise.
Conclusions #
Looking at the use-cases for SQLite: SQLite use-cases, my scenario seems to fit within SQLite’s limitations but I think the long query times are more a problem with how SQL processes JOINs across large data sets.
Github Repo #
Check out the benchmark code for yourself: